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TITLE OF THE INVENTION 

TRANSLATION APPARATUS AND METHOD 

CROSS - REFERENCE TO RELATED APPLICATIONS 

This application is based upon and claims the 
benefit of priority from the prior Japanese Patent 
Application PH 2001-20195, filed on January 29, 
2001; the entire contents of which are incorporated 
herein by reference* 

FIELD OF THE INVENTION 



Hi 

m The present invention relates to a translation 

apparatus and a method for correctly translating a 
headline in newspaper article. 

BACKGROUND OF THE INVENTION 

Recently, machine translation software is 
widely utilized in order for a user to read Web page 
on the Internet. For example, in case of reading a 
Web page of on line-news reporting a trend of 
foreign countries in real time, the machine 
translation software is utilized. In general, the 



- 1 - 



Web page of on line-news includes a headline and an 
article body. The headline represents a summary of 
the article body. The Web page is described in a 
first language (For example, English) and the 
machine translation software automatically 
translates the Web page into second language (For 
example, Japanese) . The user whose native tongue is 
the second language reads the Web page after the 

y : machine translation. In this case, before the user 

P 

p reads translated sentences of the article body, he 

m 

Jfi often reads a translated headline in order to decide 

00 

OCl whether to read the article body or not. 

SJ 

s Accordingly, translation of the headline is more 

w 

|M; important than translation of the article body. 

in 

HI However, in this news article (For example, 

Q 

ft) English article), new proper nouns not registered in 
a translation dictionary are often used and a style 
of the English article is unique. Accordingly, 
machine translation is difficult. Especially, the 
headline (a title of the article) is fragmentally 
described on the assumption of background knowledge 
of an English-speaking people. Accordingly, machine 
translation of the headline is extremely difficult. 

As mentioned - above , the style of news article 
headline is unique and its machine translation is 
quite difficult . 
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SUMMARY OF THE INVENTION 



It is an object of the present invention to 
provide a translation apparatus and a method to 
correctly translate a headline in a news article* 
According to the present invention, there is 
provided a translation apparatus for translating 
machine readable article information of a first 
language including an article body and a headline as 
y a summary of the article body, comprising: a 

o 

ITI decision unit configured to d i s c r i mi n a t e 1 y decide 
the article body and the headline in the article 

m 

% = information; and a translation unit configured to 
D respectively translate the article body and the 

j y headline into a second language based on the 

y'i 

C* decision result of said decision unit. 

m 

Further in accordance with the present 
invention, there is also provided a translation 
method for translating machine readable article 
information of a first language including an article 
body and a headline as a summary of the article body, 
comprising: d i s c r i mi n a t e 1 y deciding the article body 
and the headline in the article information; and 
respectively translating the article body and the 
headline into a second language based on the 
decision result. 
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Further in accordance with the present 
invention, there is also provided a computer program 
product, comprising: a computer readable program 
code embodied in said product for causing a computer 
to translate article information of a first language 
including an article body and a headline as a 
summary of the article body, said computer readable 
program code having: a first program code to 
discriminately decide the article body and the 
headline in the article information; and a second 
program code to respectively translate the article 
body and the headline into a second language based 
on the decision result. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram of the translation 
apparatus according to one embodiment of the present 
invention . 

Fig. 2 is a flow chart of decision processing 
of a preprocessing unit according to one embodiment 
of the present invention. 

Figs. 3A and 3B are flow charts of detail 
processing of S5 in Fig. 2. 

y i 

J5 Fig. 4 is a schematic diagram of a calculation 

m method of similarity degree between article of 

SI 



D 



translation object and each of a plurality of stored 
articles . 

Figs. 5A and 5B are flow charts of high speed 
algorithm of similar a r t i c 1 e - r e t r i e va 1 processing. 

Fig. 6 is a flow chart of processing of target 
word information processing unit according to one 
embodiment of the present invention. 

Fig. 7 is a flow chart of processing of a 
phrase alignment processing unit according to one 
embodiment of the present invention. 

Fig. 8 is a flow chart of abbreviation 
estimation processing of the phrase alignment 
processing unit according to one embodiment of the 
present invention. 
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Fig. 9 is a flow chart of information source- 
detection processing for news article according to 
one embodiment of the present invention. 

Fig. 10 is a block diagram of the translation 
apparatus according to another embodiment of the 
present invention. 

Fig. 11 is a schematic diagram of component of 
English- Japanese parallel corpus in Fig. 10. 

Figs. 12A and 12B are schematic diagrams of 
target word information in Japanese. 

Fig. 13 is a schematic diagram of target words 



xn Japanese 

p 



Figs. 14A-14D are schematic diagrams of target 
word information in Japanese. 

Figs. ISA and 15B are schematic diagrams of 
target word information in Japanese. 

Fig. 16 is a schematic diagram of target words 
in Japanese . 

Figs. 17A-17H are schematic diagrams of target 
word information in Japanese. 

Fig. 18 is a schematic diagram of target words 
in Japane se . 

Figs. 19A and 19B are schematic diagrams of 
target word information in Japanese. 
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DETAILED DESCRIPTION OF THE EMBODIMENTS 
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Hereinafter, various embodiments of the present 
invention will be explained by referring to the 
drawings. Fig. 1 is a block diagram of the 
translation apparatus according to one embodiment of 
the present invention. In Fig. 1, an apparatus for 
translating an English article into a Japanese 
article is shown as an example. However, the 
present invention can be applied to translation 
between any arbitrary two languages. 

In the translation apparatus shown in Fig. 1, a 
headline part and an article body part are 
respectively extracted from the news article, and 
each part is exactly translated. In order to 
accomplish this purpose, a component to improve a 
translation accuracy by using translation method 
corresponding to classification of the news article, 
a component to improve the translation accuracy by 
correct extraction of a noun phrase including an 
abbreviation and by translation of the noun phrase, 
and a component to improve the translation accuracy 
by using suitable translation method for the 
headline and the article body, are prepared. These 
components can be respectively utilized as a single 
unit or free combined units. 
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In Fig. 1, the translation apparatus includes a 
recording unit such as a hard disk to store an 
analysis dictionary 6, an Engl ish - Japanese parallel 
corpus 7, and a translation dictionary 8; a 
processing unit such as a preprocessing unit 1; a 
similar article retrieval unit 2; a target word 
information extraction unit 3; a phrase alignment 
processing unit 4; and a translation processing unit 
5. Each processing unit can be composed by a 
program . 

First, electronic information of an English 
article is input to the preprocessing unit 1. The 
preprocessing unit analyzes the English article as a 
translation object and identifies the headline and 
the article body in the English article. Fig. 2 is 
a flow chart of algorithm to identify the headline 
and the article body in the preprocessing unit 1. 
As an example in Fig. 2, the English article of the 
translation object is a Web page of a news site. In 
Fig. 2, the preprocessing unit 1 obtains URL 
(Uniform Resource Locator) of the Web page of the 
translation object by checking (SI), and decides 
whether the Web page is registered as a news site 
based on the URL (S2) . If the Web page is 
registered as a news site, the preprocessing unit 1 
identifies the headline and the article body in the 
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English article by using a decision algorithm 
corresponding to the news site . Examples of 
registered URL are represented as follows. 

"http://xxxxnews*xxxxx.com/headlines/ts/index. 
html" 

"http : / /www. xxx. com/ " 
"http : / /www . newsxxx . com/" 
"http: //www.xxtimes . com/" 

In this case, as for each registered URL, a 
decision algorithm of he ad 1 i n e / a r t i c 1 e body 
corresponding to each Web page is prepared. For 



if* example, on a Web page in which the headline is 

%. i 

3 located between two tags < NYT^HEADLINE/ and <. 

P, /NYT_HEADLINE> , a position of the headline can be 

l|| decided by the two tags. Furthermore, on a Web page 

Ei 

in which the article body (lead part) is located 
between two tags < NYT_SUMMARY > and < /NYT_SUMMARY> , 
a position of the article body can be decided. 
Ordinarily, arrangement of the headline and the body 
part is prescribed for each news site. Accordingly, 
the preprocessing unit 1 can respectively extract 
the headline and the article body by using the 
decision algorithm corresponding to the prescribed 
arrangement. If a part of the URL of a news site is 
different from URL of the same news site, structures 
of the Web pages of these two news sites are often 
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different. In this case, a decision algorithm of 

the headline/article body is registered for each URL. 

If the URL is not registered, the preprocessing 
unit 1 decides whether the URL includes characters 
possible to be news site such as "news" or "press" 
(S4) . If these characters are included in the URL, 
the Web page is characterized as a news site. By- 
applying a general decision algorithm to the input 
English article, the preprocessing unit 1 extracts a 
headline and an article body. Pigs. 3A and 3B are 
flow charts of the decision algorithm of S5 in Pig. 
2. In the decision algorithm of Figs. 3A and 3B, 
decision/extraction of the headline and the article 
body is possible for the news site of unregistered 
URL . 

First, the preprocessing unit 1 obtains 
electronic information of a Web page of the 
translation object (Sll), deletes any non-display 
parts such as script code from the Web page (312), 
extracts continuous characters not including tags 
from the Web page, and assigns the number of the 
characters part (the unit number of words) to 
variable N (S13). Next, the preprocessing unit 1 
obtains tag data prescribing display attributes of 
the characters part (S14) . The preprocessing unit 1 
assigns "1" to variable I (S15) and decides whether 



"N" is smaller than "1" (SIS). If "N" is smaller 
than "1", the processing is completed. If "N" is 
larger than "1", the preprocessing unit 1 decides 
whether the attribute of characters part is the same 
as attribute used for headline (S17) . For example, 
the preprocessing unit 1 decides whether the 
characters of decision object is in bold type, 
linked to another page, or indicated as a large size 
font in comparison with other parts. The headline 
in the article is ordinarily displayed by a bold 
type or a font larger than a font of article body, 
and a predetermined tag is often used. Furthermore, 
the headline is sometimes HTML (Hyper Text Markup 
Language) linked to detail page in the article body. 
Accordingly, the preprocessing unit 1 regards the 
decision of S17 as one standard to decide the 
headline. However, the bold font is often used for 
a writer name and a date. Accordingly, if the 
preprocessing unit 1 decides as the font often used 
for the headline, second decision of S18 is further 
executed. The preprocessing unit 1 decides whether 
the characters part I of decision object is often 
used for a part except for the headline (SIB). For 
example, the characters part is decided to include 
"Written by — - or "Photo by " , and decided to be 
the numerical values of the date. Furthermore, the 



preprocessing unit 1 decides a headline part by 
utilizing a limit of the number of words for the 
headline part. In short, the preprocessing unit 1 
decides the number of words (S19, S20) . For example, 
even if the characters part is in bold type or HTML 
linked, the character part is not often the headline 
in case that the characters part is consisted of 
below several words. Conversely, in case that the 
§*!; characters part consists of too many words, the 
p characters part is probably not the headline. The 

in 

Jfl preprocessing unit 1 identifies a headline if the 

op. 

CO number of words is above three and below ten (S19, 

Si 

20, 21) . Furthermore, if the character part is 
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decided as a part except for the headline (S17), the 
preprocessing unit 1 can decide whether the 
character part is the article body or the other part 
by counting the number of words (S22) . In short, if 
the number of words is above ten, the character part 
is decided as the article body (S23) . If the number 
of words is below ten, the character part is decided 
as the other part (S24) . Furthermore, if the 
character part is decided as a part except for the 
headline (S18, 19), the preprocessing unit 1 decides 
whether the characters part is the article body or 
some other part (S24). After decision of S21, 23, 
24, the preprocessing unit 1 increments I by " 1 " 
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(S25) and repeats the processing from S16 for the 
next character part- 
As for decision methods of the headline and the 
article body, various methods can be considered. 
For example, the headline is often positioned at 
head of a page or <C HEAD> part of HTML document- By 
utilizing this position rule, the headline part can 
be identified- In short, by utilizing these various 
decision standards, the headline and the article 
body, of which decision accuracy is lower than of 
Fig. 2, can be identified. 

If an English article as a translation object 



is an SGML (Standard Generalized Markup Language) or 
f ( an XML (extendable Markup Language) document, the 
preprocessing unit 1 can easily identify the 

y i 

W headline and the article body by referring to the 

m 

tag code. Furthermore, even if an English article 
as a translation object is a word processor document 
or a text document that does not include tag data or 
attribute data, characters close to a head part of 
the article, which is not a writer's name, a place 
name, or a date are identified as a headline of the 
article, and characters following the headline are 
identified as the article body. By utilizing this 
heuristics, the headline can be automatically 
decided to some extent. To identify, the writer 
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name or the place name, morphological analysis may 
be utilized* Furthermore, in case of not 
identifying the headline and the article body, by 
displaying candidates of headline, the headline and 
the article body may be decided by indication input 
of the user. 

As shown in Fig. 1, the preprocessing result of 
the preprocessing unit 1 is supplied to the similar 
article retrieval unit 2, the phrase alignment 
processing unit 4, and the translation processing 
unit 5. In the present embodiment, by utilizing the 
preprocessing result, translation part can be 
discriminated and translated based on classification 
of news article by the similar article retrieval 
unit 2 and the target word information extraction 
unit 3. In addition to this, a noun phrase can be 
correctly extracted and translated by the phrase 
alignment unit 4, and the headline and the article 
body can be suitably translated by the translation 
processing unit 5. 

The target words based on classification of the 
news article are obtained by the similar article 
retrieval unit 2 and the target word information 
extraction unit 3. First, by using a word vector as 
a processing result of the preprocessing unit 1, the 
similar article retrieval unit 2 retrieves an 
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article similar to the English article as a 
translation object from English articles of English- 
Japanese parallel corpus 7. The Eng 1 i sh - Japane se 
parallel corpus 7 is a database which registers each 
English article and corresponding translation 
(Japanese) articles. The Japanese article, which is 
of good quality with assistance, is desirable. An 
abridged translation of the English article may be 
registered if extraction processing of target word 
information (mentioned later) can be executed. 

The analysis dictionary 6 correspondingly 
stores a headword of an English word, a part of 
speech, the plural form, the abbreviation form, and 
the conjugation form. This information is utilized 
for morphological analysis processing of the similar 
article retrieval unit 2, i.e., morphological 
analysis of English article of translation object 
and English articles stored in the Eng 1 i s h - Jap an e s e 
parallel corpus. The content of the analysis 
dictionary 6 is duplicated for the Engl i sh - Japane se 
dictionary of the translation dictionary 8. 
Accordingly, the translation dictionary 8 can 
substitute for the analysis dictionary 6. 

The similar article retrieval unit 2 retrieves 
an article similar to English article of translation 
object from English- Japanese parallel corpus 7 by 
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following steps from (a) to (f ) . Fig. 4 is a 
schematic diagram of the calculation method of 
similarity degree between the English article of 
translation object and each of a plurality of 
English articles stored in the E n g 1 i s h - J ap a n e s e 
parallel corpus 7. 

(a) The headline and the article body are 
morphologically analyzed using the analysis 

M s dictionary 6. Each word is extracted from the 

S headline and the article body. 

m 

*P (b) Appearance frequency of each word is 

m 

w calculated in an article (the headline and the 

\! 

■ article body) . A vector, of which the stem of a 

0 

word is dimension and the frequency of the word is 

ru 

HI dimensional value, is created for the article. An 

o 

FU index of the dimension (each word) is represented as 
"k" and the vector of each English article is 
represented as "ek". 

(c) As for each English article in the English- 
Japanese parallel corpus 7, similar processing of 
steps (a) (b) is executed. In this case, an index 
of article number is represented as "j", an index of 
the dimension (each word) is represented as "k", and 
a vector of each article is represented as "Ejk". 

(d) A similarity degree between the article of 
translation object and each article in the English- 
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Japanese parallel corpus 7 is calculated as inner 
product between two article vectors as shown in Fig 
4. The similarity degree between the English 
article of translation object and each English 
article j in the E n g 1 i s h - J ap a n e s e parallel corpus 7 
is calculated by the following equation (1). 



r _ S k==1 e k E jk 

c o S (j> - =^k — k (1) 

Z, k=1 e k XZf k==1 E jk 2 



(e) Each pair of English article and Japanese 
(translation) article in Engl i sh - Japane se parallel 
corpus 7 is sorted in order of higher similarity 
degree* If a similarity degree is below a threshold, 
a pair including the English article of the 
similarity degree is excluded. 

(f) A predetermined number of pairs of English 
article and Japanese article are selected in order 
of higher similarity degree, and output as the 
similar article. 

In short, the similar article retrieval unit 2 
identifies an English article having a high 
similarity degree in the Engl i sh - Japane se parallel 
corpus 7 as an article similar to the English 
article as a translation object. This processing 
(article alignment technique) of the similar article 
retrieval unit 2 is disclosed in the following 



references (1), (2), and (3), the contents of which 
are herein incorporated by reference. 

(1) Collier, N. Kumano, A., Hikrakawa, H. 
"English- Japanese news article alignment form the 
internet using MT", Japan SOC. For AI annual meeting, 
1 99 8 . 

(2) Collier, N., Hirakawa, H., Kumano, A. 
"Machine Translation vs Dictionary Term Translation 
- a comparison for E n g 1 i s h - Jap an e s e news article 



Q alignment", COLING-ACL- 1998 . 

Ill 

J*\ (3) Collier, N., Hirakawa, H., Kumano, A. 

m 

jgj "Creating a noisy parallel corpus from newswire 

9 articles using mu 1 1 i - 1 i ngu a 1 information retrieval", 

o 

%d< Transactions of J* SOC. Information Processing, 1999. 
|fi The processing of step (c) may be previously 

o 

py executed and the processing result (word vector of 
each English article) may be stored in the English- 
Japanese parallel corpus 7. In this case, the high 
speed processing can be executed, and necessary 
memory capacity can be reduced because English 
article body is not stored in the Engl i sh - Japanese 
parallel corpus 7. 

In case of decision of the similarity degree, 
the similar article retrieval unit 2 lowers the 
weight of proper nouns, dates and quantities. The 
retrieved similar article is used for extracting the 
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target word. The retrieved similar article is not 
necessarily an article related to the same affair 
described in the English article as a translation 
object. it is sufficient that a type of the affair 
(such as a fire, or a purchase of a company) of the 
retrieved article is similar to the English article 
as a translation object. in other words, it is not 
necessary that information such as who, what, where, 
and how represented by proper nouns, dates, and 
quantities in the retrieved article is similar to 
the English article as a translation object. 
Accordingly, the weights of those words are lowered 
in case of decision of the similarity degree. 
Conversely, if these weights are not lowered, 
sufficient number of similar articles cannot be 
retrieved from the English- Japanese parallel corpus 
7, and the extraction processing of target word 
information (explained afterwards) cannot be 
suitably executed. 

Furthermore, instead of word extraction by 
morphological analysis at steps (a), (b), and (c), 
the stem of the English word may be extracted by 
using heuristic rule called "Porter algorithm" and 
utilized as the word. This processing is called 
"stemming" and can be executed at high speed without 
the dictionary. The Porter algorithm is disclosed 
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in the following reference, the contents of which 
are herein incorporated by reference. 

(4) Porter, M.F., "An Algorithm For Sumx 
stripping,", Program 14 (3), July 1980, pp. 130-137. 

Furthermore, the weights of the proper noun (a 
word from which starts by capital letter), the date, 
and quantities such as an amount of money, are 
lowered at steps (b), (c). However, weights of 

U words in the headline and a head paragraph (lead) 

P 

p Part of the article may be large in comparison with 

w 

J$ words in the article body. 

ffl 

OB Figs. 5A and 5B are flow charts of a high speed 

SI 

algorithm of processing of steps (d), <e), and (f). 
In the algorithm of Figs. 5A and 5B, in case of 
calculating the similarity degree of each article in 
ftl the English - Japanese parallel corpus 7 at step (d), 
a predetermined number of articles of which the 
similarity degree is in order of higher are updately 
stored at each timing. m this case, the memory 
capacity necessary for processing is greatly reduced, 
and high speed processing can be accomplished 
without sorting processing of step (e) . 

In Figs. 5A and 5B, assume that an upper limit 
of the number of similar articles output from the 
similar article retrieval unit 2 is N, a total 
number of English articles in the En g 1 i sh - Japane s e 
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parallel corpus 7 is M, and a threshold of the 



similarity degree is P» An arrangement "ARRAY" of 



which size is N is prepared (S31) • Next, a variable 



L (minimum of similarity degree of articles in 



ARRAY) is set by "1" and a variable K (the number of 



articles in ARRAY) is set by "0" (S32) . The English 
article number I in the English - Japanese parallel 



corpus 7 is initialized by " 1 " ( S 3 3 ) . The similar 



article retrieval unit 2 decides whether retrieval 

C3 processing of similar article is executed for all 

0 

yj English articles in the E n g 1 i s h - J ap an e s e parallel 

+ ! 

corpus 7 (S34). The similar article retrieval unit 

ffi 

N 2 calculates the similarity degree S between English 



G article of translation object and English article I 
HLj in the English - Japanese parallel corpus 7 by the 
Q inner product of article vectors calculated at 

m 

above-mentioned steps (a) , (b) , and (c) . Then, the 



similar article retrieval unit 2 decides whether the 



similarity degree S is above a threshold P (S37). 



If the similarity degree S is not above the 



threshold P, the English article I is decided as 



non-similar article and the processing is forwarded 



to S46. The processing following from S34 are 



repeated for the next English article (1 + 1) . If the 



similarity degree S is above the threshold S (S36), 



the similar article retrieval unit 2 decides whether 
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the number K of articles in the arrangement ARRAY is 
over the size N (S38) . If the number K is not over 
the size N, the English article I is added to the 
arrangement ARRAY, and the number K of articles is 
incremented by "1" (S39) . 

Next, the similar article retrieval unit 2 
decides whether the similarity degree S is below the 
minimum L of similarity degree of English articles 
M' in the arrangement ARRAY (S40) . If the similarity 

degree S is not below the minimum L , the processing 
is forwarded to S46. If the similarity degree S is 

m 

00 below the minimum L, the similarity degree S is 



G 
in 



assigned to the minimum L (S41) and the processing 
forwarded to S46 . If the number K of articles in 
the arrangement ARRAY is over the size N (S38) , the 
similar article retrieval unit 2 decides whether the 
similarity degree S is above the minimum L of 
similarity degree of articles in the arrangement 
ARRAY (S42). If the similarity degree S is not 
above the minimum L , the processing is forwarded to 
S46 and processing is executed for the next article. 
If the similarity degree S is above the minimum L , 
the similar article retrieval unit 2 deletes the 
article of the minimum L from the arrangement ARRAY 

(543) , adds the article I to the arrangement ARRAY 

(544) , and substitutes a new minimum of similarity 
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degrees of articles in the arrangement ARRAY for the 
minimum L (S45). The processing is forward to S46 
and similar processing is repeated for the next 
article . 

When the similar article retrieval unit 2 
decides that the retrieval of the similarity degree 
is executed for all English articles in the English- 
Japanese parallel corpus 7 (S34), the similar 
articles in the arrangement ARRAY are output (S35). 
In this way, the similarity degree is calculated for 
each article, and the unit number K of English 
articles, arranged according to the highest 
similarity degree, are stored in the arrangement 
ARRAY. Accordingly, memory capacity necessary for 
the processing is reduced and high speed processing 
is executed without sorting. 

The electronic information of the similar 
article is supplied to the target word information 
extraction unit 3. The target word information 
extraction unit 3 extracts the English word and the 
target word from the similar English article and the 
Japanese (translation) article detected by the 
similar article retrieval unit 2. in short, as the 
target words of each word in the English article 
input to the preprocessing unit 1, Japanese words in 
the Japanese article corresponding to the similar 
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English article are utilized* In this case, the 
target word information extraction unit 3 detects a 
Japanese word to which the English word is 
translated from the similar English article and 
corresponding Japanese article, and outputs the 
Japanese word as the target word information. 

The translation dictionary includes an English- 
Japanese dictionary and a Jap a n e s e - En g 1 i s h 
dictionary. The En g 1 i sh - J ap an e s e dictionary 
correspondingly includes a headword of an English 
word, a part of speech, a plural form, a conjugation 
form, and the target word (Japanese) . The Japanese- 
English dictionary correspondingly includes a 
headword of a Japanese word, a part of speech, a 
conjugation form, and the target word (English) . 
The target word information extraction unit 3 
utilizes the E n g 1 i s h - J ap an e s e dictionary in the 
translation dictionary 8 in order to obtain the 
English word from the similar English article and 
obtain the target word (Japanese) candidates. 
Furthermore, the target word information extraction 
unit 3 utilizes the Jap an e s e - E ng 1 i s h dictionary of 
the translation dictionary 8 in order to obtain the 
Japanese word from the Japanese article 
corresponding to the similar English article and 
obtain the target word (English) candidates. The 



translation processing unit 5 executes translation 
by referring to the English- Japanese dictionary in 
the translation dictionary 8* 

Fig. 6 is a flow chart of processing of the 
target word information extraction unit 3. The 
target word information extraction unit 3 utilizes 
the translation dictionary 8 including English- 
Japanese dictionary and Ja p a ne s e - En g 1 i s h dictionary. 
First, the target word information extraction unit 3 
p obtains each English word of the similar English 

Ul 

article and obtains the equivalent candidate 
QB (Japanese word) from the Engl i sh - Japane s e dictionary 

Si 

B in the translation dictionary 8 (S51). Next, the 

0 

M' target word information extraction unit 3 obtains 

ru 

\f\ each Japanese word of the Japanese article 

0 

Rj corresponding to the similar English article and 
obtains the target word candidate (English word) 
from the Jap a n e s e - En g 1 i s h dictionary in the 
translation dictionary 8 (S52). Next, the target 
word information extraction unit 3 selects a target 
word candidate (Japanese word) appearing in the 
Japanese article from the target word candidates 
corresponding to each English word of the similar 
English article (S53) . As for the selected target 
word candidates corresponding to the English word Em, 
the target word information extraction unit 3 
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regards the target word candidate appearing the most 
often in the Japanese article as the Japanese target 
word Jm of the English word Em f and creates a set 
(Em, Jm, Hm) consisting of the English word Em, the 
Japanese target word Jm, and the appearance 
frequency Hm (S54) . Next, the target word 
information extraction unit 3 selects the target 
word candidate (English word) appearing in the 
similar English article from the target word 
candidates corresponding to each Japanese word of 
the Japanese article (S55) . As for the selected 
target word candidates corresponding to the Japanese 
word Jn , the target word information extraction unit 
3 regards the target word candidate appearing the 
most often in the similar English article as the 
English target word En of the Japanese word Jn, and 
creates a set (En, Jn, Hn) consisting of the English 
target word En, the Japanese word Jn , and the 
appearance frequency Hn (S56) . 

In this way, a correspondence between each 
English word in the English article and a Japanese 
target word in the Japanese article is estimated. 
Next, the target word information extraction unit 3 
merges the two word pairs {Em, Jm, Hm) , (En, Jn, Hn) 
(S57) • In short, the target word information unit 3 
merges the two pairs of which "Em-En, Jm== Jn" to one 



P 



word pair (Em, Jm, Hm+Hn) . if a plurality of 
Japanese words (different Japanese target words) 
exist for one English word, the target word 
information extraction unit 3 selects the word pair 
including the English word and the maximum frequency, 
and deletes other word pairs including that English 
word (S58). Last, the target word information 
extraction unit 3 outputs each word pair as the 
target word information of each English word (S59). 
In this way, as for the similar English article, the 
Un Japanese target word of each English word and the 
W frequency data are obtained as the target word 

00 

SI information. By translating the input English 

article using this target word information (Japanese 
target word in Japanese article corresponding to the 
similar English article), translation based on 
classification of the article can be executed. 

As a method for extracting the target word 
information from the parallel corpus, various 
methods can be taken into consideration. In the 
algorithm of Fig. 6, even if an arrangement and 
structure of sentences in the English article are 
different from the Japanese article, effective 
target word information can be obtained. For 
example, in case of translated article of a 
newspaper, sentence style and order of description 
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content are often different from an original article 
in order for native reader to easily read. The 
algorithm of Fig. 6 is suitable for translation of 
newspaper article. 

As a processing of the target word information 
extraction unit 3, various modifications can be 
considered. For example, as one modification 
example, the preprocessing unit 1 extracts each 
English word from the English article of translation 
object, and extracts the target word (Japanese word) 
of each English word only. This extraction 
processing of the target word information can be 
executed at high speed. Furthermore, as another 
modification example, in case of preprocessing, the 
English article as a translation object is 
translated once by the translation processing unit. 
In this case, the target word of each English word 
is extracted and output to the target word 
information extraction unit 3. The extracted target 
word is set as a default target word of the English 
word. Then, the target word information unit 3 
outputs the target word information excluding the 
default target word to the translation processing 
unit 5. In this method, the target word information 
which contributes to change of the target word is 
output from the target word information extraction 
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unit 3, and processing of the translation processing 
unit 5 can be executed at high speed. 

In Fig. 1, the phrase alignment processing unit 
4 can affect the correct extraction and translation 
of a noun phrase. For example, as for a company 
name, even if correct noun phrase (the company name) 
is described in the article body, a part of the noun 
phrase is often described in the headline. In short 
a shortened expression or an abbreviation is often 
utilized for the headline. In case of using an 
ordinary translation dictionary, correct translation 
is impossible. Accordingly, the phrase alignment 
unit 4 calculates a similarity degree between a 
phrase (noun phrase) of the headline and a noun 
phrase of the article body (especially, a noun 
phrase of head sentence in the article body), and 
outputs correspondence information of phrases 
indicating the same object (phrase alignment result) 
In this way, the abbreviation in the headline can be 
correctly translated. 

Fig. 7 is a flow chart of algorithm of phrase 
alignment processing. First, the phrase alignment 
processing unit 4 morphologically analyzes the 
headline and the article body (or a head sentence of 
the article body), and extracts parts of speech 
satisfying a predetermined condition (For example. 



the following equation (2)) as a noun phrase 
candidate (S61, 62). in this case, the phrase 
alignment processing unit 4 can extract the noun 
phrase candidate from syntax analysis result* 
However, extraction from morphological analysis 
result can be executed at high speed. The phrase 
alignment processing unit 4 previously describes a 
condition of candidate extraction of parts of speech 
by regular expression. The following equation (2) 

Q represents one example of the condition. 

D 

"article ? ( noun / ad j e c t i ve ) * noun" ■■■ (2) 

JL-, 

CO In th © equation (2), " ? " represents omission of 

OP 

SI part of speech locating just before, " ( O / CU ) " 

0 represents " O • or " □ " , and " * " represents at least 

fU one time of repeat of part of speech locating just 

m 

g before (In the equation (2) # a noun or an adjective). 

ry 

Next, the phrase alignment processing unit 4 
extracts a noun phrase candidate corresponding to 
the noun phrase of the headline from the article 
body, especially a head sentence of the article body 
(S63, 64). in this case, as for all combinations of 
each noun phrase candidate of the headline and each 
noun phrase candidate of the article body, the 
phrase alignment processing unit 4 detects 
coincidence of words (obtained from morphological 
analysis) in each combination (S63). if a 
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coincidence degree (the number of coincident 
words/the number of all words in the noun phrase 
candidate) between two noun phrase candidates of one 
combination is above a predetermined threshold, the 
phrase alignment processing unit 4 extracts the two 
noun phrase candidates as mutual corresponding noun 
phrases (S64) . For example, if a noun phrase of the 
headline consists of three words, if a noun phrase 
candidate of the article body consists of five words, 
and if two words in the noun phrase of the headline 
coincide with two words in the noun phrase candidate 
of the article body, then the coincidence ratio is 
"2/5". If the threshold is "1/3", the noun phrase 
of the headline and the noun phrase candidate of the 
article body are extracted as the same one. 

However, if the number of words of a noun 
phrase in the headline is larger than the number of 
words of a noun phrase in the article body, i.e., if 
the noun phrase in the article body is a subset of 
the noun phrase of the headline, the noun phrase in 
the headline is better for the translation. 
Accordingly, the phrase alignment processing unit 4 
deletes a pair of two noun phrases extracted at step 
S64 (S65) . For example, assume that a noun phrase 
in the headline is "S. Korean/ship/fire", and a 
corresponding noun phrase in the article body is "S. 
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Korean/ship/fire" or "ship/fire". In short, the 
noun phrase in the article body is the same as or 
one part of the noun phrase in the headline. In 
this case, if the noun phrase in the headline is 
replaced by the noun phrase in the article body 
according to the phrase alignment result and used 
for translation, original information of the noun 
phrase in the headline is lost. Accordingly, such 
pair of two noun phrases is deleted at step S65. 

Furthermore, as a noun phrase in the article 
body corresponding to a noun phrase in the headline, 
00 for example, a plurality of different noun phrases 

Kj 

S| (such as different abbreviation methods) are often 

S3 

Q used. Accordingly, if a plurality of noun phrases 
fy in the article body are extracted for one noun 

if! 

Q phrase in the headline at step S64, the phrase 

M 

alignment processing unit 4 extracts a noun phrase 
of which the coincidence degree is the highest from 
the plurality of noun phrases as a corresponding 
noun phrase (S66). Last, the phrase alignment 
processing unit 4 outputs a pair of two corresponded 
noun phrases (367). In case of comparison of the 
noun phrase, the phrase alignment processing unit 4 
utilizes a headword of the dictionary instead of 
appearance form of each word in the article. 
However, as for an unknown word, the phrase 
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alignment processing unit 4 utilizes the appearance 
form. Furthermore, the headline often includes many- 
abbreviation expressions. Accordingly, the original 



headword of the dictionary replaces an abbreviation 



in the headline and is utilized for comparison with 
the article body. For example, if the headline 
includes expressions "mln" and "bin", the headword 
"million" "billion" of the dictionary are utilized 
for comparison with the article body. 

3 : 

© In the coincidence detection at step 363 , the 

Ifl phrase alignment processing unit 4 uses an algorithm 

j? 

P3 shown in Fig. 8. Fig. 8 is a flow chart of 

BB 

%! abbreviation estimation processing in case that the 

p headline includes an abbreviated expression. For 

fU example, if an unknown word "HKMA" appears in the 

m 

Q headline and a noun phrase "Hong Kong/Monetary 

Authority" appears in the article body, the unknown 
word is decided to correspond to the noun phrase. 
In this case, " / " represents a word pause, and 
characters from "/" to next "/" represent the entry 
of the dictionary. 

First, the phrase alignment processing unit 4 
divides a noun phrase of the headline into separate 
words (S71) . As the noun phrase, the abbreviation 
is described alone or described by connecting to 
other words. At step S71, if the noun phrase of the 
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headline includes a space or a hyphen, the phrase 
alignment processing unit 4 divides the noun phrase 
at a position of the space or the hyphen. The 
divided words are regarded as a word sequence A. 
The phrase alignment processing unit 4 divides the 
noun phrase of the article body (or the head 
sentence in the article body) into separate words 
(S72). The divided words are regarded as a word 
se quenc e B . 

Next, the phrase alignment processing unit 4 
|| decides whether at least one English word in the 

word sequence A has only capital letters only (S73). 
If at least one English word consists of only 
capital letters, this English word is added to an 
abbreviation candidate arrangement RA (S74) . Next, 
the phrase alignment processing unit 4 decides 
whether a word series in the word sequence B 
consists of consecutive words each including a 
capital letter at the head position (S75). if so, 
the phrase alignment processing unit 4 creates a 
character series by connecting the capital letters 
beginning each word, and adds the character series 
to an abbreviation candidate alignment RB (S76) . 
The phrase alignment processing unit 4 counts the 
number of coincident words between the word 
sequences A and B (S77). This processing is the 
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same processing as step 363 in Pig. 7. Furthermore, 
an abbreviation in the abbreviation candidate 
arrangement R A is decided to be the same as original 
word series of the abbreviation in the abbreviation 
candidate arrangement RB . Accordingly, the phrase 
alignment processing unit 4 counts the same entry of 
the abbreviation between the abbreviation candidate 
arrangement RA and RB, and adds the counted value to 
the number of coincident words (S78) . In this way, 
by utilizing the algorithm shown in Fig. 8, a pair 
of corresponded noun phrases between the headline 
and the article body can be obtained by considering 
the abbreviation. 

Furthermore, by extending the abbreviation 
|fj estimation processing in Fig. 8, for example, the 

o 

rt I phrase alignment processing unit 4 can estimate that 
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the abbreviation "MITI" corresponds to "the Ministry 
of International Trade and Industry". In this case, 
the phrase alignment processing unit 4 creates an 
abbreviation candidate by deleting an article, a 
conjunction and a preposition located just before 
capital letter-word or put between two capital 
letter - words , and adds the abbreviation candidate to 
the abbreviation candidate arrangement B. 
Furthermore, for example, if a word "Alexander" is 
included in the headline and a word "Alexander the 
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Great" is included in the article body, these two 
words are decided to partially correspond. In short, 
as for a noun phrase including a space or a hyphen, 
the noun phrase is divided at the space or the 
hyphen, and each divided unit is regarded as one 
noun. In this way, by the phrase alignment 
processing unit 4, a noun phrase in the headline is 
replaced by a suitable noun phrase in the article 
body, and the noun phrase in the headline is 
correctly translated. 

In Pig. 1, the translation processing unit 5 
executes translation using the target word 
information as output result of the target word 
information extraction unit 3 and the phrase 
alignment result as output result of the phrase 
alignment processing unit 4. In short, in case of 
translating an English word in the English article 
as a translation object, if the English word is 
included in the target word information extracted by 
the target word information extraction unit 3, the 
translation processing unit 5 give priority to the 
corresponding Japanese word as the target word. 
Furthermore, by using the phrase alignment result 
(correspondence information of noun phrase) from the 
phrase alignment processing unit 4, the translation 
processing unit 5 replaces (supplements) a noun 
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phrase fragment in the headline with a corresponding 
noun phrase in the article body. Furthermore, by 
using the preprocessing result, the translation 
processing unit 5 suitably translates the headline 
and the article body. For example, in case of 
translating the headline, the translation processing 
unit translates by applying a translation rule for 
the headline, for example, the target word is 
concluded by a substantive. 

Next, an operation of the present embodiment is 
explained. Assume that an article including the 
following < English article translation object^ is 
i npu t . 

C English article translation object^ 
Dissss to buy back up to 95 mln shares 

BUUBANK, Calif., April 23 (Reete) - Waaa Dissss 
Co said its board had approved a stock repurchase 
program of up to 95 million shares. 

The program replaces a similar program that was 
in place prior to its acquisition of Caapii 
Citti/AAC, it said on Monday. 

The preprocessing unit 1 extracts the headline 
and the article body from the input article. The 
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headline and the article body are supplied to the 
similar article retrieval unit 2 as the 
preprocessing result* In above-mentioned C English 
article translation object^ , "Dissss to • • • shares" 
is the headline, and "BUUBANK, ••• Monday." is the 
ar t i cle body . 

In the Engl i sh - Japane s e parallel corpus 7, a 
plurality of English articles of various fields and 
a plurality of Japanese articles as the translation 
of each English article are correspondingly stored, 
■p The similar article retrieval unit 2 morphologically 
jSp analyzes the English article of translation object 
and each English article in the Engl i sh - Japane se 
parallel corpus 7 by referring to the analysis 
dictionary 6, generates each word vector of the 
English article of translation object and each 
English article, and retrieves one English article 
similar to the English article of the translation 
object from the Engl i sh - Japane se parallel corpus 7. 
The one English article having the highest 
similarity degree in the Eng 1 i sh - Japane se parallel 
corpus is decided as an article similar to the 
English article translation object. 

Assume that the retrieval result by the similar 
article retrieval unit 2 is the following < 
Retrieval result of similar article>. In short. 
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the following retrieval result is an example 
(headline is only shown) of an English article 
similar to the <English article translation object 
^ in Eng 1 i sh - Japane s e parallel corpus 7. The 
following retrieval results are arranged in order of 
higher similarity degree. 



^Retrieval result 
H SIMILARITY DEGREE 

o 

O 0,582435250282288 

m 

m 

SO 0.574999988079071 

N 

s 

jess, 

M< 0.529697775840759 

m 

IT! 

p 

ftl 0.505964457988739 



0.464757978916168 



0 . 4 6 1 88 0 2 3 68 64 0 9 0 



0 . 44446 7 12 73 2 3 15 1 



0 . 433 333 3 3 73 0 6 97 6 



0 . 42 7 61 79 6 712 8 754 



of similar articled 
HEADLINE 

Notwet to buy back up to 2 
mln shares 

Cisss increases buyback 
program 

Deel Computer increases 

share buyback 

Micoot Inc bought back 

164,500 shares 

PainWer increases share 

buyback plan 

Gillee sets 10-15 mln share 
buyb a c k 

Campbee heir continues share 
sale 

Texxxa has bought 1.5 mln 
shr s 

AMM to buy back up to 20 mln 
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of its shares 

Following < similar articled represents the 
English article of which the similarity degree is 
the highest in CRetrieval result of similar article 
^ and corresponding translation article in the 
Engl i sh - Japane se parallel corpus 7. 



C similar articled 
S <! English articled 
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Notwet to buy back up to 2 mln shares 

MINNEAPOLIS , Dec 6 (Reete) - Notwet Airlines 
Corp said Friday its board had approved a program to 
buy back up to two million shares of Class A common 
stock. The repurchases will occur from time to time 
in the open market or through negotiated 
transactions, the airline said. Shares repurchased 
under the program would offset dilution resulting 
from the exercise of employee stock options, the 
company said. As of October 31, Notwet had 
90,000,000 common shares outstanding (100,000,000 on 
a fully distributed and diluted basis), the company 
said. 

<C Japanese articled 
(See Fig . 12A) 
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The similar article retrieval unit 2 outputs 
electrical information of a similar article of which 
the similarity degree is above a threshold to the 
target word information extraction unit 3. The 
target word information extraction unit 3 extracts 
the target word English words in the similar article 
by referring to the translation dictionary 8. For 
example, as for Japanese candidate (the target word) 
of English word "exercise" in the <C English article 
]> of C similar articled , the translation dictionary 
stores (See Fig. 12B) . On the other hand, only (See 
Fig. 13(1)) is included in above-mentioned <C 
Japanese articled . Accordingly, the target word 
information extraction unit 3 selects (See Fig. 
13(2)) as the target word of "exercise". In the 
same way, as for English candidate (the target word) 
of Japanese word (See Fig. 13(3)) in the < Japanese 
articled, the translation dictionary stores 
"repurchase/redeem/buy". In this case, only "buy" 
is included in above-mentioned < English articled . 
Accordingly, the target word information extraction 
unit 3 selects (See Fig. 13(4)) as the target word 
of "buy". In this way, the target word information 
extraction unit 3 selects following ^Target word 
information^ for above-mentioned C similar articled 
In the following ^Target word information^, «(•••)« 
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represents a part of speech of English, " ( n ) 98 
represents a noun, "(v)" represents a verb, and 

represents a part of speech of Japanese. 



CTarget word information^ 
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repurchase (n) 
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exercise (n) — ► 
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Fig . 
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( 15 ) ) 




emp 1 oy e e ( n ) — ► 
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( 16 ) ) 




stock option (n) -* 1 


(See 


Fig. 


13(17)) 




dilute ( v ) 


(See 


Fig. 


13(18)) 



In above-mentioned ^Target word information^ , 
an extraction example of the target word information 
for < similar article> of which the similarity 
degree is the highest is explained. However, in the 
same way, extraction processing of the target word 
information is actually executed for all similar 
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articles in ^Retrieval result of similar article>. 

On the other hand, the phrase alignment 
processing unit 4 inputs electronic information of 
characters of the headline and the article body as 
the preprocessing unit* The phrase alignment 
processing unit 4 executes phrase alignment 
processing for characters of the headline and 
characters of the article body. First, the phrase 
m alignment processing unit 4 extracts a noun phrase 

o 

£3 "Dxssss", "back up", "95 mln/shares" from the 

headlxne. Then, the phrase alignment processing 
unit 4 extracts a noun phrase "BUUBANK", "Calif", 
"April/23", "Reete", "Waaa Dissss/Co", "board", 
"stock/repurchase/program", "95 million/shares" from 
yS a head sentence of the article body. In these noun 
pj phrases, a combination of two noun phrases commonly 
including the same word is "Dissss" and "Waaa 
Dissss/Co", "95 mln/shares" and "95 million/shares". 
As mentioned - above , in case of calculating the 
coincidence degree by considering the headword 
including a space or a hyphen, the former is 33% 
(1/3) and the latter is 100% (3/3). In this case, 
"mln" is regarded as "million" but deleted because 
"95 mi 1 1 i on / shar e s " is a subset (the same as )of "95 
mln/shares". Assume that a threshold at step S64 in 
Fig. 7 is 30%. The phrase alignment processing unit 
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4 outputs the following ^Phrase alignment result 

Cphrase alignment result^ 
Dissss — ► Waaa Dissss/Co 



Si 



s 



The above-mentioned ^Target word information^ 
and C Phrase alignment result^ are supplied to the 
translation processing unit 5. The translation 
processing unit 5 executes translation by using not 
only Engl i sh - Japane se dictionary of the translation 
dictionary 8 but also ^Target word information^ 
and <C Phrase alignment result^ . Following Cused 
target word^ represents the target word information 
used by the translation processing unit 5 for 
translation of <English article of translation 
object^ in the target word information extracted by 
the target word information extraction unit 3. In 
this case, following <Used target word> 
contributes to change of general target word stored 
in the Eng 1 i s h - J ap an e s e dictionary. Concretely, as 
for target word of English word at the left edge, a 
general target word based on the translation 
dictionary 8 shown at the left side of an arrow ( — * ) 
is changed to a special target word at the right 
side of the arrow. «(•••)» represents a part of 
speech of English, " ( n ) " represents a noun, " ( v ) * 
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represents a 
of speech of 



verb, and represents 



Japane se . 



a part 



Cused target word^ 

board(n) (See Fig. 13(19)) (See Fig. 13(20)) 
buy(v) (See Fig. 13(21)) -» (See Fig. 13(22)) 
program(n) (See Fig. 13(23)) (See Fig. 13(24)) 
say(v) (See Fig. 13(25)) (See Fig. 13(26)) 

M; stock(n) (See Fig. 13 (27) )^ (See Fig. 13(28)) 

b 
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Following^ Translation result> represents 
translated article of C English article translation 
object^ by using the above-mentioned Cused target 
word> and ^Phrase alignment result^. In the 
following CTranslation result^, in order to 
compare with a case not using ^Target word 
information> (<Used target word)) and <Phrase 
alignment result> , an ordinary translation result 
(<Prior translation^ ) using the translation 
dictionary 8 only and a special translation result 
(< Application translation^ ) using <Target word 
information> and <Phrase alignment result> are 
shown by unit of one sentence. Furthermore, 
different part between <Prior translation^ and <C 
Application translation^ is marked up by " Fj " . 
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<C Translation result> 

Headline: Dissss to buy back up to 95 mln 
share s 

<Prior translation^ : (See Fig. 14A) 
<Application t r an s 1 a t i on > : (See Fig. 14B) 

Original sentence 1: BUUBANK, Calif. f April 23 
(Reete) - Waaa Dissss Co said its board had approved 

a stock repurchase program of up to 95 million 
share s . 

<Prior trans lation> : (See Fig. 14C) 
<Application t r a n s 1 a t i o n > : (See Fig, 14D) 

Original sentence 2: The program replaces a 
similar program that was in place prior to its 
acqu i sit ion o f 

Caapii Citti/AAC, it said on Monday. 
<Prior translation> : (See Fig. 15A) 
<Application t r an s 1 a t i on> : (See Fig, 15B) 

As shown in the above-mentioned ^Translation 
result>, in the headline, (See Fig. 16(1)) of < 
Prior translation> is changed to (See Fig. 16(2)) 
of < Application translation^ as more exact company 
name. In the article body, as for the equivalent of 
"stock", (See Fig. 16(3)) of <Prior translation> 
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is changed to (See Fig. 16(4)) of < App 1 i c a t i on 
translation^ . As for the target word of "board", 
(See Fig. 16(5)) of <Prior translation> is changed 
to (See Fig. 16(6)) of <Application t r a n s 1 a t i on > . 
As a whole, suitable target words are used. 
Furthermore, by utilizing the target word 
information, the target word of the headline is 
expected to be improved. 

O The style of English headline is unique, and a 

D 

if| suitable translation sentence is not often obtained 
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by regular translation. Accordingly, the 
translation processing unit 5 prepares a translation 
rule for the headline's exclusive use, and applies 
the translation rule in case of translating the 



lU 

un 

p headline only. Following <Headline application 



translation> represents a special translation 
result ( < Appl i ca t ion t r an s 1 a t i on > ) using the 
translation rule for the headline's exclusive use 
and an ordinary translation result (<Prior 
translation^ ) for original sentences Rl~R4. 

<Headline application translation> 

Original sentence Rl : PLO arrests 90 Arabs in 
Gaza -Jericho crackdown 

<Prior translation> : (See Fig. 17A) 
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< App 1 i c a t i on translation^ : (See Fig. 17B) 



Original sentence R2 : Interactive tv to offer 
viewers new powers 

<Prior t r an s 1 a t i on> : (See Fig. 17C) 

< Application translation)^ : (See Fig. 17D) 
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Original sentence R3 : Indian 1994/95 GDP seen 
rising 5.3 pet - Sharma 

<! Prior translation^ : (See Fig. 17E) 

<C Application translation^ : (See Fig. 17F) 

Original sentence R4 : Chechen conflict may 
overshadow CIS summit 

<Prior translation^ : (See Fig. 17G) 

<C Application translation^ : (See Fig. 17H) 

The example of original sentence Rl is 
applicable example of substantive conclusion rule. 
In case that a verb at the end of a sentence is (See 
Fig. 18(1)) # a subject particle (See Fig. 18(2)) is 
changed to l~,J except for (See Fig. 18(3)) at the 
end of a sentence. In case that an object of the 
verb is not included in the sentence, if the subject 
particle (See Fig. 18(4)) is changed to I",J , the 
translated sentence becomes unnatural. Accordingly, 
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in this case, the substantive conclusion rule is not 
app 1 ied. 

The example of original sentence R2 is 
applicable example of translation rule of "to". By 
using this rule, order of the target words becomes 
more natural in the translation sentence. 

The example of original sentence R3 is 
applicable example of translation rule of "seen". 
The example of original sentence R4 is 
p applicable example of translation rule of "may". 
HI In this way, by applying the translation rule 

£0 for the headline's exclusive use, the translation 

00 

Si sentence becomes more natural. In this case, if 

Q this translation rule is applied to the article body, 

ft; the translation sentence of the article body becomes 

O conversely unnatural. Accordingly, it is necessary 

m 

that the headline and the article body are decided 
by the preprocessing and the translation rule is 
applied to the headline only. 

In the above-mentioned example of original 
sentence R3, a change of the target word of "Sharma" 
is based on information source processing of news 
explained afterwards. Furthermore, in the example 
of original sentence R4 , a change from (See Fig. 
18(5)) to (See Fig. 18(6)) of the target word of 
"summit" is based on the target word information 
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from the target word information extraction unit 3. 

In the above-mentioned example, translation 
processing of one article was explained. However, 
if one document includes a plurality of articles, 
after the headline and the article body are 
extracted from each article, the similar article 
retrieval processing, the target word information 
extraction processing, the phrase alignment 

H ! processing, and the translation processing are 

Q executed for each article. 

IP In the headline of English news, information 

■ source of the news is often shown at the end of the 

H* sentence. If such a headline is translated by 

ill ordinary method, correct translation result cannot 

Q 

ill be often obtained. Accordingly, it is decided 
whether a word at the end of the headline is 
information source of news by referring to a head 
sentence of the article body. If the word at the 
end of the headline is the information source, a 
translation method by dividing the headline at the 
word is applied. In this processing, both the 
headline and the head sentence of the article body 
are referred. Accordingly, the phrase alignment 
processing unit 4 preferably executes this 
processing in parallel with the phrase alignment 
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processing . 

Fig. 9 is a flow chart of an algorithm for 
information source processing. First, the phrase 
alignment processing unit 4 extracts a noun phrase 
at the end of a sentence from the headline based on 
morphological analysis result of the headline, and 
regards this noun phrase as noun phrase A (S81). 
Next, the phrase alignment processing unit 4 
extracts a subject of verb (For example, "report", 
"say", "tell") typically used as expression of 
information source from the head sentence of the 
article body (S82). in short, a pattern "noun 
phrase + ("report" or "say" or "tell")" is compared 
with morphological analysis sequence of the head 
sentence. In case of coincidence, a coincident 
pattern in the head sentence is regarded as a noun 
phrase B. The form of these verbs may be the past 
form, the present form, or the perfect form. Next, 
the phrase alignment processing unit 4 decides 
whether the noun phrase B exists (already extracted 
from the head sentence) (S83). in case of existence 
of the noun phrase B, the phrase alignment 
processing unit 4 decides whether the noun phrases A 
and B are included in the phrase alignment result 
(S84). in this case, the phrase alignment result 
used at step S84 is obtained without execution of 
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Step 365 in Fig. 7. In short, the noun phrase in 
the article body may be a subset of (or the same as) 
the noun of the headline. If the noun phrase A and 
B are included in the phrase alignment result, the 
phrase alignment processing unit 4 decides that the 
noun phrase A is the information source part, and 
outputs the information to the translation 
processing unit 5. 
jM In the following < Processing result of news 

O 

q information source)) , in case that the phrase 

m 

j~ alignment processing unit 4 decides the information 

m 

qq source part using a verb representing the news 

SI 

, information source, translation result (< 

Ci 

y ; Application translation)) of the translation 

M 

ypj processing unit 5 is shown. In addition to this, an 

D 

Hi ordinary translation result (<Prior translation)) 
not using the news source detection processing 
result is shown. 



^Processing result of news information source> 
Headline: H KM A nearing full control of HK 
banking- -analysts 

Head sentence of article body: HONG KONG, Feb 
10 (Reete) - The Hong Kong Monetary Authority ( H KM A ) 
will move a step closer to gaining complete control 
over the colony's banking system if the Banking 
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(Amendment) Bill 1995 passes in late February, 

analysts said. 

<Prior translation^ : (See Fig. 19A) 

<C Application translation^ : (See Fig. 19B) 

In the above-mentioned <Processing result of 
news information source^ , "analysts" at the end of 
headline represents the information source of 
article. The reason why this part is the 
information source is that "analysts said." is 
located at the end of head sentence of article body. 
In <Prior translation> , this part is not correctly 
trans 1 a ted . 

By using algorithm shown in Fig. 9, a noun 
"analysts" in the headline is decided as the 
information source. Because a word "analysts" is 
located at the end of the headline, an expression 
"analysts said" is located at the end of the head 
sentence of article body, and the same word 
"analysts" is included in the phrase alignment 
result between the headline and the head sentence of 
article body. 

The phrase alignment processing unit 4 outputs 
the decision result to the translation processing 
unit 5. The translation processing unit 5 divides 
the headline at this part (hyphen immediately before 
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"analysts")/ translates each divided noun phrase, 
and outputs as a final translation of the headline 
by connecting each translated noun phrase. In this 
way, as shown in < Application translation^ , the 
translation sentence of the headline becomes a more 
suitable expression. 



As mentioned-above, in the present embodiment, 
the headline and the article body are respectively 
detected, the target word information and the noun 
phrase are correctly extracted, and the headline and 
the article body are appropriately translated. As a 



CO 

Si result, translation accuracy greatly improves, 
p Concretely, as for the headline, by applying a 

Rj translation rule for headline's exclusive use, the 

If! 

translation sentence of the headline becomes more 



natural. As for fragmental noun such as the name of 
a person or a company abbreviated in the headline, 
they can be translated as correct target word (not 
the abbreviation) by the phrase alignment processing 
for the noun phrase in the article body. As a 
result, translation quality of the headline improves 
Furthermore, by suitably adding information not 
included in the headline, the target word of the 
headline can be easily read and understood by a 
subscriber. Furthermore, the target word 
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information extracted from the retrieved similar 
article is utilized. As a result, the target word 
accuracy of the headline and the article body- 
improve s . 



o 

y ! 



rr • 

m 



m 

ill 



In the above-mentioned embodiment, an example 
of an English to Japanese translation was explained. 
However, the basic concept can be applied to 
translation between other languages such as Japanese 
to English, Germany to English, French to English, 
Chinese to English, Russian to English, etc . 
Furthermore, in the above-mentioned embodiment, 
extraction processing of the target word information 
using E n g 1 i s h - J ap a n e s e parallel corpus was explained. 
However, a single language corpus of the target 
language can be utilized. For example, Japanese 
article corpus is prepared for Eng 1 i sh - Japane s e 
translation. After an English article of 
translation object is normally translated, a 
Japanese article similar to the translation result 
is retrieved from the Japanese article corpus by 
using the above-mentioned method. Then, the 
extraction processing of target word information is 
executed for the retrieved Japanese article and the 
English article of translation object. By using the 
target word information, the English article is 
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translated again. Furthermore, as a modification of 
this method, in case of retrieving a Japanese 
article similar to the English article translation 
object, the target word candidate of each word in 
the English article is obtained by referring to an 
English - Japanese dictionary, and the similar 
Japanese article is retrieved from the corpus by 
using the target word candidate. This method is 
disclosed in the above reference (1). In this 
method, the English article translation object is 
translated only one time, and the processing can be 
executed at a high speed. in general, a creation of 
a single language corpus is easily executed in 
comparison with a creation of a parallel corpus. 
Accordingly, the method using the single language 
corpus is advantageous from this point. 



Fig. 10 is a block diagram of the translation 
system according to another embodiment of the 
present invention. in Fig. 10, as for the same 
component element compared with Fig. 1, the same 
sign is assigned and explanation is omitted. in 
this embodiment, in comparison with Fig. 1, instead 
of the similar article retrieval unit 2, the 
English - Japanese corpus 7 and the translation 
dictionary 8, a similar article r e t r i e va 1 / t a r g e t 
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word extraction unit 12, an English- Japanese 
parallel corpus 11, and a translation dictionary 13 
are respectively adopted. As a result, the target 
word information extraction unit 3 in Fig. 1 is 
omitted. The translation dictionary 13 is a 
dictionary in which Japane se - Engl i sh translation 
dictionary is deleted from the translation 
dictionary 8. As a draw back of the target word 
information extraction algorithm in Fig. 6, a target 
word not registered in the translation dictionary 
cannot be extracted from the article. Accordingly, 
as for a pair of English article and Japanese 
article in the English- Japanese parallel corpus 11, 
after the target word information is extracted from 
the English article and the Japanese article, 
deletion of unsuitable target word and addition of 
insufficient target word are properly executed in 
order to modify the target word information. Then, 
the modified equivalent information is previously 
stored in the Engl i sh - Japane se parallel corpus 11 in 
correspondence with identifier of article including 
the original target word. Fig. 11 is a schematic 
diagram of component of the Engl i sh - Japane se 
parallel corpus in Fig. 10. As shown in Fig. 11, 
the target word (Japanese word) of each English word 
is stored in correspondence with English article ID 
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including each English word. 

In this case, the similar article 
re t r ieval / tar ge t word extraction unit 12 directly 
retrieves the target word information (Japanese 
word) of each English word corresponding to English 
article ID retrieved as the similar article ID. As 
a result, extraction processing of the target word 
information is not necessary. 

As men t ioned - above , in this embodiment, in 

Q addition to high speed processing, it is not 

n 

'SB' 

U] necessary that each pair of English article and 

■F 

E Japanese article is stored in the Engl i sh - Japane se 

66 

SI parallel corpus. Accordingly, necessary memory 

* 

p capacity can be greatly reduced. 

U 

55 i 

w 

m 

U A memory can be used to store instructions for 

a sJ 

performing the process described above. Such a 
memory can be a CD-ROM, floppy disk, hard disk, 
magnetic tape, semiconductor memory, and so on. 

Other embodiment of the invention will be 
apparent to those skilled in the art from 
consideration of the specification and practice of 
the invention disclosed herein. It is intended that 
the specification and examples be considered as 
exemplary only, with the true scope and spirit of 
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the invention being indicated by the following 
claims . 
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